AfterWork 


Text Summarization with Python 


Learning Outcomes 


By the end of this topic, you will have achieved the following learning outcomes: 


e | can understand the concept of text summarization. 
e | can understand the importance of text summarization. 
e | can perform text summarization to create summaries from longer text documents. 


"If you want to understand people, especially your customers, then you have to be 
able to possess a strong capability to analyze text." Paul Hofmann 


Reading 


What is Text Summarization? 


Text summarization is a subdomain of Natural Language Processing (NLP) that refers 
to the techniques of creating short, accurate and fluent summaries from longer text 
documents. 


According to /nderjeet Mani in, Advances in Automatic Text Summarization, text 
summarization is the process of distilling the most important information from a 
source (or sources) to produce an abridged version for a particular user (or 
users) and task (or tasks). 


This concept is important in situations where we would need to sift through lots of 
redundant and insignificant data and thus summarizing this data in a way that it contains 
non-redundant and useful information only. With text summarization using python, this 
task becomes allows us to deal with the growing amounts of data that could become 
overwhelming. 


Why is Text Summarization Important? 


1. It helps to reduce the reading time. 
2. It helps in situations of a research process where it increases the amount of data 


that one can sift through. 
3. It is less biased than human readers. 
4. It saves text storage space while retaining the original meaning. 


How does Text Summarization Work? 


The following figure describes various text summarization techniques. 
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Image Source: [Link] 


As shown above, text summarization methods can be classified as follows: 
a. Based on Input Type 
i. Single Document: In this case, the input length is short. 
ii. Multi-Document: In this case, the input is long. 


b. Based on The Purpose 
i. Generic: In this case, the model which performs automatic text 
summarization makes no assumptions about the domain or content of the 
text to be summarized. 
ii. Domain-Specific: In this case, the model uses domain-specific 
knowledge to form a more accurate summary. 
iii. Query Based: In this case, the summary only contains information which 
answers natural language questions about the input text. 
c. Based on Output Type 
i. Extractive: This is a popular text summarization approach that we will be 
exploring in depth. In this case, a subset of words that represent the most 


important points is pulled from a piece of text and combined to make a 
summary. This involves gauging the weights of the sentences and then 
ranking them according to their relevance and similarity with one another 
and later joining them to generate a summary. 

ii. Abstractive: In this case, the model forms its own phrases and sentences 
to offer a more coherent summary. Although abstraction performs better 
at text summarization, developing its algorithms requires complicated 
deep learning techniques and sophisticated language 
modelling. 


Examples of Text Summaries 


Original text: Justus and Mary rode a horse to attend the annual event in the Diani. 
Along the way, Mary made a phone call to their sister Elizabeth. 


Example: Extractive summarization 
e Justus and Mary attended an event in Diani. Mary called Elizabeth. 


Example: Abstraction-based summarization 
e Justus and Mary went to Diani while Mary called Elizabeth on phone. 


How does Text Summarization Work? 


As we will focus on extractive summaries, this process will involve determining which 
sentences are the most important, making a summary that is cohesive and readable. We 
would also minimize the number of references to ideas and entities not mentioned in the 


summary. 
The following steps elaborate on this process: 


Step 1: Performing text processing 
e This step would involve performing text processing methods i.e. removing 
stop words, numbers, punctuations, special characters, lowercasing, etc. e It 
should be noted that this step is normally dependent on the kind of text that 


one might be dealing with. 


Step 2: Converting the paragraph into sentences 
e This step would involve splitting the paragraph into corresponding 
sentences by extracting a sentence whenever a period appears. 
Step 3: Performing tokenization 
e This step helps us retrieve words present in the sentences. 


Step 4: Finding the weighted frequency of occurrences 
e This step involves finding the weighted frequency of occurrences of all the 
words with the main goal of identifying words that carry the most weight in 
the text/document. 


Step 5: Substituting words with their weighted frequencies 
e The main goal of this step would be to compute the sum of weighted 
frequencies for each sentence so as to determine which sentence carries 
the most weight in the paragraph. 


Step 6: Getting the summary 
e We then combine multiple sentences with highest weighted frequencies 
taking into consideration of a threshold, to produce a summary. 


We should note that step 4 and 5 involve scoring sentences based on the constructed 
intermediate representation which refers to how well each sentence relays important 
aspects of the text being summarized. Algorithms for each of these 2 steps can vary but 
entail: assigning a score to each sentence using some metric, and then selecting from 
the best-scored sentences via some well-defined sentence selection method. What we 
have used is a frequency driven approach. 


Other approaches include topic word approaches, latent semantic analysis (LSA) and 
Bayesian topic models i.e. latent Dirichlet allocation (LDA). 


Applications of Text Summarization 


e News: Text summarization can be used in the creation of introductions, 
generating headlines and embedding captions on pictures. 
e Scientific Research: It can be used to dig out important information from 
Scientific research papers. 
e Social Media Posting: It can be used to convert long blog articles into shorter 
ones suited for the audience. 
e Creating Study Notes: It can be used to create student notes from vast syllabus 
and content. 
e Conversation Summary: Long conversations and meeting recordings can be 
converted into text and then important information can be fetched out of them. e 
Movie Plots and Reviews: Movie plots can be converted into bullet points. 
e Deliverable Feeds: Can be used in the generation of deliverable feeds which are 
the short piece of information derived from the complete informative articles. These 
are generally delivered to people through emails or feed delivery services. 
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